Getting Started with TTS
๐ฃ๏ธ What is Text-to-Speech (TTS)?โ
Text-to-Speech (TTS) is a technology that converts written text into spoken voice. It allows computers to "speak" words out loud using synthetic or recorded voices. In the context of SkyrimNet, TTS is used to give NPCs the ability to talk dynamically โ not just with pre-recorded lines, but with sentences generated on the fly by the AI.
๐ง How It Worksโ
TTS works in two steps: first, the system breaks down the written sentence into sounds (called phonemes), and then it uses a voice model to produce speech audio from those sounds. SkyrimNet uses modern neural TTS systems (like XTTS or Zonos) to make the voices sound natural and emotional โ as if a real person were speaking. These voices can be customized to sound robotic, dramatic, calm, or even imitate characters.
๐ฎ Why It Matters in SkyrimNetโ
With TTS, SkyrimNet can give life to AI-powered NPCs who speak new, personalized lines every time they interact with you. This means conversations are no longer limited to pre-written dialogue. NPCs can comment on your actions, remember past events, or express emotion โ all in their own voice, without requiring voice actors or modding tools like Creation Kit.
๐ SkyrimNet TTS Engine Comparison
SkyrimNet supports multiple TTS backends โ each with unique strengths in quality, speed, and customization. Here's a side-by-side comparison of Zonos, XTTS, and Piper, so users can choose the best engine for their systems.
โ๏ธ Feature Comparisonโ
Feature | ๐ง Zonos | ๐ฃ๏ธ XTTS (Default) | โก Piper |
---|---|---|---|
Voice Quality | ๐๏ธ Studio-grade, cinematic | ๐ง Very high, expressive | ๐ Good, clean, lightweight |
Voice Cloning | โ Yes, identity cloning | โ Yes, from voice sample | โ No cloning |
Emotional Control | ๐ก Planned | ๐ก Basic support (tone hints) | โ None |
Accent/Language Support | โ Wide | โ Cross-lingual | ๐ก Limited |
Speed | โ ๏ธ Slower (heavier inference) | โ Moderate (~1โ2s latency) | โก Instant (~100โ200ms) |
Local Integration | ๐ Local HTTP endpoint | ๐ Local HTTP endpoint | ๐งฉ In-process (no server) |
Output Format | WAV / PCM | WAV / PCM | PCM (16-bit mono, 22050Hz) |
Best Use | followers/ high end system | General dialogue, dynamic LLM | Background NPCs, fast chatter |
๐ Resource Usage (Approximate)โ
Engine | CPU Usage | vRAM Usage | Load Time | Notes |
---|---|---|---|---|
Zonos | ๐ฅ High | ๐ฅ High (aprox 6GB) | ๐ ~1โ3s | Large models, best for key scenes |
XTTS | โ ๏ธ Moderate | โ ๏ธ Moderate (aprox 3GB ) | ๐ ~1โ2s | Real-time feasible, very flexible |
Piper | โ Low | โ none (cpu only) | โก Instant | Fastest, most efficient TTS |
๐งช Note: Resource usage depends on the hardware and specific model used. GPU acceleration improves both Zonos and XTTS significantly, its load times can reach instant on high end systems
๐ฏ Summaryโ
Engine | Strengths | Tradeoffs |
---|---|---|
Zonos | Cinematic quality, voice cloning, emotional nuance | Slower, heavier; ideal for premium content |
XTTS | Great balance of quality, cloning, and speed | Slight delay; ocasional voice drifts |
Piper | Extremely fast and lightweight for real-time interaction | No cloning or advanced voice features |
๐ ๏ธ Choosing the Right Engineโ
Scenario | Recommended Engine |
---|---|
Voiced main quest with drama/emotion | ๐๏ธ Zonos |
Companion with a personalized voice | ๐ฃ๏ธ XTTS |
Fast ambient barks / guards / vendors | โก Piper |
Fully dynamic AI-driven conversations | ๐ฃ๏ธ XTTS |
Low-end PC / reduced available vram | โก Piper |
TL;DRโ
- Zonos = Premium, cinematic, cloned voices with deep expression
- XTTS = Default engine with cloning and great all-around quality
- Piper = Fastest engine, perfect for lightweight real-time voice playback
All three engines can eventually be mixed and matched per actor or event within SkyrimNet for optimal performance and immersion. (note: not currently as of beta4)